Standardizing Data

python

datacamp

machine learning

EDA

Author

kakamana

Published

January 23, 2023

Standardizing Data

Standardizing data is all about making sure that your data fits the assumptions that the model is making about the distribution or amount of features you have. Standardizing your data will help make sure that it fits these assumptions and will ultimately improve your algorithm’s performance.

This Standardizing Data is part of Datacamp course: Preprocessing for Machine Learning in Python

This is my learning experience of data science through DataCamp

Code

import pandas as pd
import numpy as np

Standardizing Data

Standardization
- Preprocessing method used to transform continuous data to make it look normally distributed
- Scikit-learn models assume normally distributed data
  - Log normalization
  - feature Scaling
When to standardize: models
- Model in linear space
- Dataset features have high variance
- Dataset features are continuous and on different scales
- Linearity assumptions

Modeling without normalizing

Let’s take a look at what might happen to your model’s accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you’ll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won’t go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

Code

wine = pd.read_csv('dataset/wine_types.csv')
wine.head()

	Type	Alcohol	Malic acid	Ash	Alcalinity of ash	Magnesium	Total phenols	Flavanoids	Nonflavanoid phenols	Proanthocyanins	Color intensity	Hue	OD280/OD315 of diluted wines	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

Code

X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine['Type']

Code

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# SCore the model on the test data
print(knn.score(X_test, y_test))

0.7777777777777778

Log normalization

Applies log transformation
Natural log using the constant \(e\) (2.718)
Captures relative changes, the magnitude of change, and keeps everything in the positive space

Checking the variance

Check the variance of the columns in the wine dataset.

Code

wine.describe()

	Type	Alcohol	Malic acid	Ash	Alcalinity of ash	Magnesium	Total phenols	Flavanoids	Nonflavanoid phenols	Proanthocyanins	Color intensity	Hue	OD280/OD315 of diluted wines	Proline
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	1.938202	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258
std	0.775035	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474
min	1.000000	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000
25%	1.000000	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000
50%	2.000000	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000
75%	3.000000	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000
max	3.000000	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000

The Proline column has an extremely high variance.

Log normalization in Python

Now that we know that the Proline column in our wine dataset has a large amount of variance, let’s log normalize it.

Code

# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542436
0.17231366191842012

Scaling data for feature comparison

Features on different scales
Model with linear characteristics
Center features around 0 and transform to unit variance(1)
Transforms to approximately normal distribution

Scaling data - investigating columns

We want to use the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset to train a linear model, but it’s possible that these columns are all measured in different ways, which would bias a linear model. Using describe() to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?

Code

wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()

	Ash	Alcalinity of ash	Magnesium
count	178.000000	178.000000	178.000000
mean	2.366517	19.494944	99.741573
std	0.274344	3.339564	14.282484
min	1.360000	10.600000	70.000000
25%	2.210000	17.200000	88.000000
50%	2.360000	19.500000	98.000000
75%	2.557500	21.500000	107.000000
max	3.230000	30.000000	162.000000

Scaling data - standardizing columns

Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let’s standardize them in a way that allows for use in a linear model.

Code

from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

print(wine_subset.iloc[:3])

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

print(wine_subset_scaled[:3])

    Ash  Alcalinity of ash  Magnesium
0  2.43               15.6        127
1  2.14               11.2        100
2  2.67               18.6        101
[[ 0.23205254 -1.16959318  1.91390522]
 [-0.82799632 -2.49084714  0.01814502]
 [ 1.10933436 -0.2687382   0.08835836]]

Standardized data and modeling

KNN on non-scaled data

Let’s first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data. The knn model as well as the X and y data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.

Code

wine = pd.read_csv('dataset/wine_types.csv')

X = wine.drop('Type', axis=1)
y = wine['Type']

knn = KNeighborsClassifier()

Code

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7555555555555555

KNN on scaled data

The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data.

Code

knn = KNeighborsClassifier()

# Create the scaling method
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.9333333333333333